CombiTagger: A System for Developing Combined Taggers

نویسندگان

  • Verena Henrich
  • Timo Reuter
  • Hrafn Loftsson
چکیده

The main task of part-of-speech (PoS) tagging is to assign the appropriate morphosyntactic category to each word in a sentence. A combination of different PoS taggers usually results in higher tagging accuracy than obtained by the use of only a single tagger. We present a new language and tagset independent system, CombiTagger, which combines automatically the output of several taggers. The system, which is open source, provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily. We demonstrate the functionality of CombiTagger by using it to develop and evaluate combined taggers for Icelandic. The most accurate individual tagger obtains an accuracy of 91.83%. CombiTagger achieves 93.09%-93.41% accuracy by combining the output of five or six taggers using simple and

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Something Borrowed, Something Blue: Rule-based Combination of POS Taggers

Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation represents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) taggers can be combined to better cope with text m...

متن کامل

Hybrid Techniques for Training Hmm Part-of-speech Taggers

We describe and experimentally evaluate a hybrid technique for training part-of-speech taggers which utilises training from small quantities of unambiguously-tagged material combined with maximum likelihood re-estimation over the target untagged corpus. This approach, unlike previous ones employing re-estimation, does not involve skilled manipulation of the initial parameters of the model or th...

متن کامل

Building Domain-Specific Taggers without Annotated (Domain) Data

Part of speech tagging is a fundamental component in many NLP systems. When taggers developed in one domain are used in another domain, the performance can degrade considerably. We present a method for developing taggers for new domains without requiring POS annotated text in the new domain. Our method involves using raw domain text and identifying related words to form a domain specific lexico...

متن کامل

Tagging the Past: Experiments using the Saga Corpus

There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...

متن کامل

Writing Annotation Instructions

In two corpus annotation projects, we followed similar strategies for developing annotation instructions and obtained good inter-coder reliability results for both (the instructions are similar in style to Allen & Core 1996). Our goal in developing the annotation instructions was that they can be used reliably, after a reasonable amount of training, by taggers who are non-experts but who have g...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009